NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Safety Interventions against Adversarial Patches in an Open-Source Driver Assistance System

Chen, Cheng; Xiao, Grant; Lee, Daehyun; Yang, Lishan; Smirni, Evgenia; Alemzadeh, Homa; Zhou, Xugui (June 2025, IEEE)

Free, publicly-accessible full text available June 28, 2026
Aspis: Lightweight Neural Network Protection Against Soft Errors

https://doi.org/10.1109/ISSRE62328.2024.00036

Schmedding, Anna; Yang, Lishan; Jog, Adwait; Smirni, Evgenia (October 2024, IEEE)

Convolutional neural networks (CNN) are incorporated into many image-based tasks across a variety of domains. Some of these are safety critical tasks such as object classification/detection and lane detection for self-driving cars. These applications have strict safety requirements and must guarantee the reliable operation of the neural networks in the presence of soft errors (i.e., transient faults) in DRAM. Standard safety mechanisms (e.g., triplication of data/computation) provide high resilience, but introduce intolerable overhead. We perform detailed characterization and propose an efficient methodology for pinpointing critical weights by using an efficient proxy, the Taylor criterion. Using this characterization, we design Aspis, an efficient software protection scheme that does selective weight hardening and offers a performance/reliability tradeoff. Aspis provides higher resilience comparing to state-of-the-art methods and is integrated into PyTorch as a fully-automated library.
more » « less
Full Text Available
TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN Computations

https://doi.org/10.1145/3721145.3725774

Guan, Jiexiong; Hu, Zhenqing; Antonopoulos, Christos D; Bellas, Nikolaos; Lalis, Spyros; Smirni, Evgenia; Zhou, Gang; Agrawal, Gagan; Ren, Bin (June 2025, ACM)

The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-on-a-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (e.g., Qualcomm Adreno GPUs) usually have a 2.5D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern(s), tiling size(s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on a mobile GPU, and compared against both popular mobile DNN frameworks and another GPU performance model. Evaluation results demonstrate that TMModel outperforms all baselines, achieving 1.48 − 3.61× speedup on individual kernels and 1.83 − 66.1× speedup for end-to-end on-device training with only 0.25% − 18.5% the tuning cost of the baselines.
more » « less
Free, publicly-accessible full text available June 8, 2026
GPU Reliability Assessment: Insights Across the Abstraction Layers

https://doi.org/10.1109/CLUSTER59578.2024.00008

Yang, Lishan; Papadimitriou, George; Sartzetakis, Dimitris; Jog, Adwait; Smirni, Evgenia; Gizopoulos, Dimitris (September 2024, IEEE)

Graphics Processing Units (GPUs) are widely de-ployed and utilized across various computing domains including cloud and high-performance computing. Considering its extensive usage and increasing popularity, ensuring GPU reliability is cru-cial. Software-based reliability evaluation methodologies, though fast, often neglect the complex hardware details of modern GPU designs. This oversight could lead to misleading measurements and misguided decisions regarding protection strategies. This paper breaks new ground by conducting an in-depth examination of well-established vulnerability assessment methods for modern GPU architectures, from the microarchitecture all the way to the software layers. It highlights divergences between popular software-based vulnerability evaluation methods and the ground truth cross-layer evaluation, which persist even under strong protections like triple modular redundancy. Accurate evaluation requires considering fault distribution from hardware to software. Our comprehensive measurements offer valuable insights into the accurate assessment of GPU reliability.
more » « less
Full Text Available
Strategic Resilience Evaluation of Neural Networks Within Autonomous VehicleSoftwareAnna Schmedding, Philip Schowitz, Xugui Zhou, Yiyang Lu, Lishan Yang, Homa Alemzadeh, and Evgenia Smirni

Schmedding, Anna; Schowitz, Philip; Zhou, Xugui; Lu, Yiyang; Yang, Lishan; Alemzadeh, Homa; Smirni, Evgenia (September 2024, Lecture Notes in Computer Science 14988, 43rd International Conference on Computer Safety, Reliability, and SecurityC, SAFECOMP 2024)
Cecarelli, Andrea; Trapp, Mario; Bondavalli, Andrea; Bitsch, Friedemann (Ed.)
Simulation-basedFaultInjection(FI)ishighlyrecommended by functional safety standards in the automotive and aerospace domains, in order to “support the argumentation of completeness and correctness of a system architectural design with respect to faults” (ISO 26262). We argue that a library of failure models facilitates this process. Such a library, firstly, supports completeness claims through, e.g., an extensive and systematic collection process. Secondly, we argue why failure model specifications should be executable—to be implemented as FI operators within a simulation framework—and parametrizable—to be relevant and accurate for different systems. Given the distributed nature of automo- tive and aerospace development processes, we moreover argue that a data-flow-based definition allows failure models to be applied to black- box components. Yet, existing sources for failure models provide frag- mented, ambiguous, incomplete, and redundant information, often meet- ing neither of the above requirements. We therefore introduce a library of 18 executable and parameterizable failure models collected with a sys- tematic literature survey focusing on automotive and aerospace Cyber- Physical Systems (CPS). To demonstrate the applicability to simulation- based FI, we implement and apply a selection of failure models to a real- world automotive CPS within a state-of-the-art simulation environment, and highlight their impact.
more » « less
Full Text Available
Strategic Resilience Evaluation of Neural Networks Within Autonomous Vehicle Software

https://doi.org/10.1007/978-3-031-68606-1_3

Schmedding, Anna; Schowitz, Philip; Zhou, Xugui; Lu, Yiyang; Yang, Lishan; Alemzadeh, Homa; Smirni, Evgenia (September 2024, Springer Nature Switzerland)

Full Text Available
Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study

https://doi.org/10.1145/3650200.3656615

Oles, Vladyslav; Schmedding, Anna; Ostrouchov, George; Shin, Woong; Smirni, Evgenia; Engelmann, Christian (May 2024, ACM)

GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted com- putations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environ- mental footprint, the eciency of HPC operations becomes both an imperative and a challenge. We examine DBEs using system telemetry data and logs col- lected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). Using exploratory data analysis and statistical learning, we extract several insights about memory reliability in such GPUs. We nd that GPUs with prior DBE occurrences are prone to experience them again due to otherwise harmless factors, correlate this phenomenon with GPU placement, and suggest manufacturing variability as a factor. On the general population of GPUs, we link DBEs to short- and long-term high power consumption modes while finding no signifcant correlation with higher temperatures. We also show that the workload type can be a factor in memory’s propensity to corruption.
more » « less
Full Text Available
Epidemic Spread Modeling for COVID-19 Using Cross-Fertilization of Mobility Data

https://doi.org/10.1109/TBDATA.2023.3248650

Schmedding, Anna; Pinciroli, Riccardo; Yang, Lishan; Smirni, Evgenia (October 2023, IEEE Transactions on Big Data)

Full Text Available
Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models

https://doi.org/10.1109/TDSC.2021.3131571

Pinciroli, Riccardo; Yang, Lishan; Alter, Jacob; Smirni, Evgenia (January 2023, IEEE Transactions on Dependable and Secure Computing)

Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using six-year field data of 100,000 HDDs of different models from the same manufacturer from the Backblaze dataset and six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures. We illustrate that their root failure causes differ from common expectations and that they remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms.
more » « less
Full Text Available
GeoSpread: an Epidemic Spread Modeling Tool for COVID-19 Using Mobility Data

https://doi.org/10.1145/3524458.3547257

Schmedding, Anna; Yang, Lishan; Pinciroli, Riccardo; Smirni, Evgenia (September 2022, GoodIT 2022: {ACM} International Conference on Information Technology for Social Good, Limassol, Cyprus, September 7 - 9, 2022)
Mourlas, cotas; Pacheco, Diego; Pandi, Catia (Ed.)
We present an individual-centric agent-based model and a flexible tool, GeoSpread, for studying and predicting the spread of viruses and diseases in urban settings. Using COVID-19 data collected by the Korean Center for Disease Control & Prevention (KCDC), we analyze patient and route data of infected people from January 20, 2020, to May 31, 2020, and discover how infection clusters develop as a function of time. This analysis offers a statistical characterization of population mobility and is used to parameterize GeoSpread to capture the spread of the disease. We validate simulation predictions from GeoSpread with ground truth and we evaluate different what-if counter-measure scenarios to illustrate the usefulness and flexibility of the tool for epidemic modeling.
more » « less
Full Text Available

« Prev Next »

Search for: All records